What is the best way to reduce the dimensionality of ranking data? Is clustering sufficient or can we do better and recreate the latent ideology space that political rankings come from?
These questions first cropped up in my work on racial polarization in ranked choice voting elections. Racial polarization metrics for plurality elections typically compare the race of respondents (or aggregated respondents) to their choice for a political office, but since ranked choice elections produce more than one choice, creating a comparable metric becomes difficult. This is particularly true because theories of ranked choice elections predict that they depolarize elections, but the mechanism by which this occurs is only relevant to choices lower on a voters list. In order to test these theories, we need a metric that condenses ranked choices into 1 variable while preserving what makes the rankings interesting. Clustering is the obvious solution, but given the desirability of a continuous variable and the success of ideal point estimation applications in similar settings, ideal point generating processes such as PCA are also tempting.
This project interogates these questions using simulated preference orderings to measure the accuracy of knn and pca models in R in replicating a set of clusters and the space of these clusters respectively, then applies what I learn to an actual ranked choice election: the 2020 Democratic primary, to see what the best techniques can tell us.
set.seed(27680)
vote_pref_dims1 <- data.frame()
## generate voters
voters<-rnorm(n=1000,mean=0,sd=.2)
## generate candidates
candidates <- seq(-1,1, by = .2) + rnorm(n = 11, mean = 0, sd = .4)
## produce rankings
for(val in voters){
vote_pref_dims1<-rbind(vote_pref_dims1,rank(abs(candidates-val))) }
## scale rankings
sim_comp_dims1<-prcomp(vote_pref_dims1, scale =TRUE)
sim_vs_pr <- data.frame(cbind(voters, sim_comp_dims1$x)) %>%
dplyr::select(voters, PC1) %>%
mutate(voters = scales::rescale(voters, to = c(-1, 1))) %>%
mutate(PC1 = scales::rescale(PC1, to = c(-1, 1)))
## visualize rankings
ggplot(as.data.frame(sim_vs_pr), aes(x = PC1))+
geom_density(fill = "lightblue") + labs(x = "preference ideal point") +
geom_density(aes(x = voters), fill = "#FF6666", alpha = .2) +
theme_classic()
## Linear Regression
reg1<- lm(voters ~ PC1, data = sim_vs_pr)
stargazer(reg1, header = FALSE, type = 'html')
| Dependent variable: | |
| voters | |
| PC1 | 0.640*** |
| (0.004) | |
| Constant | 0.199*** |
| (0.002) | |
| Observations | 1,000 |
| R2 | 0.962 |
| Adjusted R2 | 0.962 |
| Residual Std. Error | 0.055 (df = 998) |
| F Statistic | 25,420.310*** (df = 1; 998) |
| Note: | p<0.1; p<0.05; p<0.01 |
As the graph shows, even in the most simple case, PCA is an imperfect measure of the ideological space the rankings are derived from. Still, as the regression table shows, even if PCA is bad at predicting the structure of the ideological space, its an extremely good predictor of the positioning of individual datapoints within the space. While the correlation is far from perfect, it is very strong, and as the \(R^2\) shows, it explains almost all of the variance in the latent ideology variable.
## Two clusters
vote_pref_dims1 <- data.frame()
## generate voters
# assign clusters
cluster <- rep(1:2, times = 500)
voters <- data.frame(cluster)
voters$x <- NA
# assign position in the space based on clusters
voters[cluster == 1,]$x<-rnorm(n = 500, mean = -.5, sd = .2)
voters[cluster == 2,]$x<-rnorm(n = 500, mean = .5, sd = .2)
## generate candidates
candidates <- seq(-1,1, by = .2) + rnorm(n = 11, mean = 0, sd = .4)
## produce rankings
for( i in 1:length(voters$x)){
vote_pref_dims1<-rbind(vote_pref_dims1,rank(abs(candidates-voters$x[i]))) }
## scale rankings
sim_comp_dims1<-prcomp(vote_pref_dims1, scale =TRUE)
sim_vs_pr <- data.frame(cbind(voters, sim_comp_dims1$x)) %>%
dplyr::select(cluster, x , PC1) %>%
mutate(x = scales::rescale(x, to = c(-1, 1))) %>%
mutate(PC1 = scales::rescale(PC1, to = c(-1, 1)))
## visualize rankings
ggplot(as.data.frame(sim_vs_pr), aes(x = PC1))+
geom_density(fill = "lightblue") + labs(x = "preference ideal point") +
geom_density(aes(x = x), fill = "#FF6666", alpha = .2) +
theme_classic()
## Linear Regression
reg1<- lm(x ~ PC1, data = sim_vs_pr)
stargazer(reg1, header = FALSE, type = 'html')
| Dependent variable: | |
| x | |
| PC1 | 0.557*** |
| (0.004) | |
| Constant | 0.015*** |
| (0.004) | |
| Observations | 1,000 |
| R2 | 0.941 |
| Adjusted R2 | 0.941 |
| Residual Std. Error | 0.121 (df = 998) |
| F Statistic | 15,885.130*** (df = 1; 998) |
| Note: | p<0.1; p<0.05; p<0.01 |
Adding clusters causes the PCA algorithm to do a much better job of reconstructing the structure of the latent ideology space. On the other hand, the regression shows that with clusters, PCA does no better and may actually do slightly worse at predicting the positioning of specific data points than without it. What about the estimating the clusters themselves?
## KNN
set.seed(123570)
one_dim_two_clust_kmeans <-kmeans(vote_pref_dims1, centers = 2,# number of clusters
nstart = 100)# number of random starts
sim_actual_clust <- cbind(one_dim_two_clust_kmeans$cluster, voters) %>%
mutate(simulated_cluster = factor(one_dim_two_clust_kmeans$cluster, labels = c("Cluster 1", "Cluster 2"))) %>%
mutate(cluster = factor(cluster, labels = c("Cluster 1", "Cluster 2")))
conf_mat(data = sim_actual_clust,
truth = cluster,
estimate = simulated_cluster )
## Truth
## Prediction Cluster 1 Cluster 2
## Cluster 1 495 6
## Cluster 2 5 494
As the confidence matrix above shows, the kmeans clustering algorithm predicts the two clusters within the data with around 99% accuracy. This makes sense given what we know so far: predicting the structure of the space is difficult but predicting the attributes of individual points is easier.
With two clusters, the clustering algorithm is much more accurate than the PCA. What about three clusters?
set.seed(129570)
## Three clusters
vote_pref_dims1 <- data.frame()
# assign cluster
cluster <- rep(1:3, times = 333)
voters <- data.frame(cluster)
voters$x <- NA
# assign position in space based on cluster
voters[cluster == 1,]$x<-rnorm(n = 333, mean = -1, sd = .2)
voters[cluster == 2,]$x<-rnorm(n = 333, mean = 1, sd = .2)
voters[cluster == 3,]$x<-rnorm(n = 333, mean = 0, sd = .2)
## generate candidates
candidates <- seq(-1,1, by = .2) + rnorm(n = 11, mean = 0, sd = .4)
## produce rankings
for( i in 1:length(voters$x)){
vote_pref_dims1<-rbind(vote_pref_dims1,rank(abs(candidates-voters$x[i]))) }
## scale rankings
sim_comp_dims1<-prcomp(vote_pref_dims1, scale =TRUE)
sim_vs_pr <- data.frame(cbind(voters, sim_comp_dims1$x)) %>%
dplyr::select(cluster, x , PC1) %>%
mutate(x = scales::rescale(x, to = c(-1, 1))) %>%
mutate(PC1 = scales::rescale(PC1, to = c(-1, 1)))
## visualize rankings
ggplot(as.data.frame(sim_vs_pr), aes(x = PC1))+
geom_density(fill = "lightblue") + labs(x = "preference ideal point") +
geom_density(aes(x = x), fill = "#FF6666", alpha = .2) +
theme_classic()
## Linear Regression
reg1<- lm(x ~ PC1, data = sim_vs_pr)
stargazer(reg1, header = FALSE, type = 'html')
| Dependent variable: | |
| x | |
| PC1 | 0.615*** |
| (0.005) | |
| Constant | 0.004 |
| (0.004) | |
| Observations | 999 |
| R2 | 0.945 |
| Adjusted R2 | 0.945 |
| Residual Std. Error | 0.120 (df = 997) |
| F Statistic | 17,276.580*** (df = 1; 997) |
| Note: | p<0.1; p<0.05; p<0.01 |
As the density plot shows PCA barely captures the middle cluster, but is able to replicate at least some of the shape of the data. As the regression table shows, PCA is still doing very well at identifying the positioning of individual points.
set.seed(149870)
## KNN
one_dim_three_clust_kmeans <-kmeans(vote_pref_dims1, centers = 3,# number of clusters
nstart = 100)# number of random starts
sim_actual_clust <- cbind(one_dim_three_clust_kmeans$cluster, voters) %>%
mutate(simulated_cluster = factor(one_dim_three_clust_kmeans$cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3"))) %>%
mutate(cluster = factor(cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3")))
conf_mat(data = sim_actual_clust,
truth = cluster,
estimate = simulated_cluster )
## Truth
## Prediction Cluster 1 Cluster 2 Cluster 3
## Cluster 1 333 0 7
## Cluster 2 0 323 1
## Cluster 3 0 10 325
Clustering with KNN is highly accurate with three clusters as well. There is a little more bleed between clusters which makes sense given that the clusters themselves have greater overlap, but overall KNN continues to outperform PCA.
set.seed(094334634)
## generate voters
# assign cluster
cluster <- rep(1:3, times = 333)
voters <- data.frame(cluster)
voters$x <- NA
voters$y <- NA
# assign positioning in the space based on cluster in two dimensions
voters[cluster == 1,]$x<-rnorm(n = 333, mean = -1, sd = .5)
voters[cluster == 2,]$x<-rnorm(n = 333, mean = 1, sd = .5)
voters[cluster == 3,]$x<-rnorm(n = 333, mean = 0, sd = .5)
voters[cluster == 1,]$y<-rnorm(n = 333, mean = .6, sd = .5)
voters[cluster == 2,]$y<-rnorm(n = 333, mean = .6, sd = .5)
voters[cluster == 3,]$y<-rnorm(n = 333, mean = -.6, sd = .5)
voters <- voters %>%
mutate(cluster = factor(cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3")))
## generate candidates
a <-seq(-1.5,1.5, by = .4) + rnorm(n = 11, mean = 0, sd = .5)
## Warning in seq(-1.5, 1.5, by = 0.4) + rnorm(n = 11, mean = 0, sd = 0.5): longer
## object length is not a multiple of shorter object length
b <- rnorm(n = 11, mean = 0, sd = .5)
candidates <- data.frame(a, b)
## Jointly plot voters and candidates
ggplot() +
geom_point(aes(x = x, y = y, color = cluster), data = voters) +
geom_point(aes(x= a, y = b), color = "black", data = candidates)
Simulating data in two dimensions is a little trickier and requires a more sophisticated approach to euclidean distance, but operates on broadly the same principal. For the purposes of two dimensional analysis, I have replaced the density plots with a scatter plot, and plotted the candidate positioning as black dots within the same space.
## produce rankings
vote_pref<-data.frame()
for(i in 1:length(voters$x)){
vote_pref<-rbind(vote_pref,rank(pointDistance(candidates, voters[i,2:3], lonlat = FALSE))) }
## scale rankings
sim_comp<-prcomp(vote_pref, scale =TRUE)
voters_sim<- cbind(voters,sim_comp$x)
voters_scaled <- voters_sim %>%
dplyr::select(x ,y, PC1, PC2) %>%
mutate(x = scales::rescale(x, to = c(-1, 1))) %>%
mutate(y = scales::rescale(y, to = c(-1, 1))) %>%
mutate(PC1 = scales::rescale(PC1, to = c(-1, 1))) %>%
mutate(PC2 = scales::rescale(PC2, to = c(-1, 1)))
reg2 <- lm(x ~ PC1 + PC2, data = voters_scaled)
reg3 <- lm(y ~ PC1 + PC2, data = voters_scaled)
stargazer(reg2, reg3, header = FALSE, type = 'html')
| Dependent variable: | ||
| x | y | |
| (1) | (2) | |
| PC1 | 0.477*** | -0.099*** |
| (0.005) | (0.007) | |
| PC2 | -0.147*** | -0.632*** |
| (0.008) | (0.010) | |
| Constant | -0.039*** | 0.028*** |
| (0.004) | (0.005) | |
| Observations | 999 | 999 |
| R2 | 0.902 | 0.809 |
| Adjusted R2 | 0.902 | 0.809 |
| Residual Std. Error (df = 996) | 0.125 | 0.160 |
| F Statistic (df = 2; 996) | 4,584.324*** | 2,112.973*** |
| Note: | p<0.1; p<0.05; p<0.01 | |
ggplot(as.data.frame(voters_scaled), aes(x = PC1, y = PC2))+
geom_point(color = "lightblue") + labs(x = "preference ideal point x", y= "preference ideal point y") +
geom_point(aes(x = x, y = y), data = voters_scaled, color = "red", alpha = .2) +
theme_classic()
How does PCA perform when estimating two dimensions? The scatter plot suggests that again, the structure of the data is very imperfectly reconstructed. In particular, the circular nature of the PCA space relative to the triangular shape of the latent space is a common symptom of overfitting with ideal point variables. Distance is easiest to equally distribute around a circle, so ideal points that are imperfectly fitted frequently converge on a circular distribution. Still, while the regression shows some ‘bleed’ between the two dimensions, it demonstrates that the PC1 and PC2 variables still explain the vast majority of the variance in x and y.
## KNN
set.seed(1916934)
two_dim_three_clust_kmeans <-kmeans(vote_pref, centers = 3,# number of clusters
nstart = 100)# number of random starts
sim_actual_clust <- cbind(two_dim_three_clust_kmeans$cluster, voters) %>%
mutate(simulated_cluster = factor(two_dim_three_clust_kmeans$cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3")))
conf_mat(data = sim_actual_clust,
truth = cluster,
estimate = simulated_cluster )
## Truth
## Prediction Cluster 1 Cluster 2 Cluster 3
## Cluster 1 317 6 33
## Cluster 2 14 8 212
## Cluster 3 2 319 88
ggplot(as.data.frame(voters_scaled), aes(x = PC1, y = PC2))+
geom_point(aes(color = sim_actual_clust$simulated_cluster)) + labs(x = "preference ideal point x", y= "preference ideal point y") +
geom_point(aes(x = x, y = y, color = sim_actual_clust$cluster), data = voters_scaled, alpha = .2) +
theme_classic()
In two dimensions, KNN also has a lot more difficulty, but still does better than PCA. Cluster misidentification in two dimensions also reflects the way I have constructed the two dimensional space, which as you can see in the original cluster scatter plot, features a fair number of dots that are closer to another cluster than the cluster they were generated from.
In two dimensions,
scale1<-prefscal(vote_pref,ndim=2,type="ordinal",conditionality="matrix",lambda = .5,omega = .1)
plot(scale1)
unfold <- as.data.frame(scale1$conf.row)
voters_scaled <- voters_scaled %>%
mutate(unfold_x = scales::rescale(unfold$D1, to = c(-1, 1))) %>%
mutate(unfold_y = scales::rescale(unfold$D2, to = c(-1, 1)))
reg4 <- lm(x ~ unfold_x + unfold_y, data = voters_scaled)
reg5 <- lm(y ~ unfold_x + unfold_y, data = voters_scaled)
stargazer(reg2, reg3, reg4, reg5, header = FALSE, type = 'html')
| Dependent variable: | ||||
| x | y | x | y | |
| (1) | (2) | (3) | (4) | |
| PC1 | 0.477*** | -0.099*** | ||
| (0.005) | (0.007) | |||
| PC2 | -0.147*** | -0.632*** | ||
| (0.008) | (0.010) | |||
| unfold_x | 0.560*** | -0.055*** | ||
| (0.005) | (0.008) | |||
| unfold_y | -0.023*** | 0.590*** | ||
| (0.006) | (0.010) | |||
| Constant | -0.039*** | 0.028*** | -0.020*** | -0.014** |
| (0.004) | (0.005) | (0.003) | (0.006) | |
| Observations | 999 | 999 | 999 | 999 |
| R2 | 0.902 | 0.809 | 0.933 | 0.768 |
| Adjusted R2 | 0.902 | 0.809 | 0.933 | 0.768 |
| Residual Std. Error (df = 996) | 0.125 | 0.160 | 0.103 | 0.177 |
| F Statistic (df = 2; 996) | 4,584.324*** | 2,112.973*** | 6,960.213*** | 1,651.744*** |
| Note: | p<0.1; p<0.05; p<0.01 | |||
Are other ideal point estimation algorithms more accurate in recreated the initial space than PCA? To test this I use the function prefscal from the smacof package, this function is specifically designed to recover a two dimensional space from ranking data (sometimes referred to in data science as unfolding). However, as the scatter plot shows, prefscal does not do much visually better than PCA at recreating the latent space and suffers from the same circular overfitting problem. The smacof package features its own scatterplot type show above the gg scater plot which jointly projects candidate positioning and voter positioning, something PCA does not. Here we can see another interesting property of these algorithms, which is that they tend to position candidates outside of the voter space. Turning to the regression, we can see that while prefscal does a little better at predicting the underlying ideal point positions than PCA, the differences are marginal at best.
One theory I had running these simulations over and over again and seeing how much variance there was in the outcomes was that the accuracy of the clustering and PCA algorithms might be highly sensitive to the positioning of candidates in the space. There were much fewer candidates in my simulation than voters, but candidates had roughly the same level of variance because they needed to be evenly distributed accross the space. I reasoned that candidates could wind up randomly positioned in ways that skewed voter rankings or made them less predictive. If a candidate was randomly generated in a position where voters would find them universally undesirable, this would rob the simulation of 1/11th of its predictive power. To test whether candidate positioning was the cause of the ‘noisiness’ of the results, I reran the two dimensional simulation with 100 candidates spread widely accross the space.
## Generate a huge and divserse pool of candidates:
a <- rnorm(n = 100, mean = 0, sd = 1)
b <- rnorm(n = 100, mean = 0, sd = 1)
candidates <- data.frame(a, b)
ggplot() +
geom_point(aes(x = x, y = y, color = cluster), data = voters) +
geom_point(aes(x= a, y = b), color = "black", data = candidates)
## produce rankings
vote_pref<-data.frame()
for(i in 1:length(voters$x)){
vote_pref<-rbind(vote_pref,rank(pointDistance(candidates, voters[i,2:3], lonlat = FALSE))) }
## scale rankings
sim_comp<-prcomp(vote_pref, scale =TRUE)
voters_sim<- cbind(voters,sim_comp$x)
voters_scaled <- voters_sim %>%
dplyr::select(x ,y, PC1, PC2) %>%
mutate(x = scales::rescale(x, to = c(-1, 1))) %>%
mutate(y = scales::rescale(y, to = c(-1, 1))) %>%
mutate(PC1 = scales::rescale(PC1, to = c(-1, 1))) %>%
mutate(PC2 = scales::rescale(PC2, to = c(-1, 1)))
reg2 <- lm(x ~ PC1 + PC2, data = voters_scaled)
reg3 <- lm(y ~ PC1 + PC2, data = voters_scaled)
stargazer(reg2, reg3, header = FALSE, type = 'html')
| Dependent variable: | ||
| x | y | |
| (1) | (2) | |
| PC1 | 0.568*** | 0.113*** |
| (0.004) | (0.004) | |
| PC2 | 0.179*** | -0.648*** |
| (0.005) | (0.005) | |
| Constant | 0.071*** | 0.101*** |
| (0.003) | (0.003) | |
| Observations | 999 | 999 |
| R2 | 0.946 | 0.951 |
| Adjusted R2 | 0.946 | 0.951 |
| Residual Std. Error (df = 996) | 0.093 | 0.081 |
| F Statistic (df = 2; 996) | 8,700.073*** | 9,752.193*** |
| Note: | p<0.1; p<0.05; p<0.01 | |
As the regression table above shows, the gains were far more marginal than I would have hoped had my theory been correct. While the PCA algorithm is somewhat more accurate than it is with fewer candidates the difference should be far greater given that I used 10x as many candidates if candidate positioning was the driver of the noisiness of the data.
set.seed(91395979)
## KNN
two_dim_three_clust_kmeans <-kmeans(vote_pref, centers = 3,# number of clusters
nstart = 100)# number of random starts
sim_actual_clust <- cbind(two_dim_three_clust_kmeans$cluster, voters) %>%
mutate(simulated_cluster = factor(two_dim_three_clust_kmeans$cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3")))
conf_mat(data = sim_actual_clust,
truth = cluster,
estimate = simulated_cluster )
## Truth
## Prediction Cluster 1 Cluster 2 Cluster 3
## Cluster 1 3 311 23
## Cluster 2 313 6 16
## Cluster 3 17 16 294
ggplot(as.data.frame(voters_scaled), aes(x = PC1, y = PC2))+
geom_point(aes(color = sim_actual_clust$simulated_cluster)) + labs(x = "preference ideal point x", y= "preference ideal point y") +
geom_point(aes(x = x, y = y, color = sim_actual_clust$cluster), data = voters_scaled, alpha = .2) +
theme_classic()
The gains from increasing the number of candidates for clustering are more significant, the clustering algorithm cuts its error rate roughly in half. I also tested whether prefscal performed better with more candidates.
scale2<-prefscal(vote_pref,ndim=2,type="ordinal",conditionality="matrix",lambda = .5,omega = .1)
plot(scale2)
unfold <- as.data.frame(scale2$conf.row)
voters_scaled <- voters_scaled %>%
mutate(unfold_x = scales::rescale(unfold$D1, to = c(-1, 1))) %>%
mutate(unfold_y = scales::rescale(unfold$D2, to = c(-1, 1)))
reg4 <- lm(x ~ unfold_x + unfold_y, data = voters_scaled)
reg5 <- lm(y ~ unfold_x + unfold_y, data = voters_scaled)
stargazer(reg2, reg3, reg4, reg5, header = FALSE, type = 'html')
| Dependent variable: | ||||
| x | y | x | y | |
| (1) | (2) | (3) | (4) | |
| PC1 | 0.568*** | 0.113*** | ||
| (0.004) | (0.004) | |||
| PC2 | 0.179*** | -0.648*** | ||
| (0.005) | (0.005) | |||
| unfold_x | 0.607*** | 0.117*** | ||
| (0.004) | (0.004) | |||
| unfold_y | 0.145*** | -0.657*** | ||
| (0.005) | (0.004) | |||
| Constant | 0.071*** | 0.101*** | 0.055*** | 0.102*** |
| (0.003) | (0.003) | (0.003) | (0.002) | |
| Observations | 999 | 999 | 999 | 999 |
| R2 | 0.946 | 0.951 | 0.958 | 0.957 |
| Adjusted R2 | 0.946 | 0.951 | 0.958 | 0.957 |
| Residual Std. Error (df = 996) | 0.093 | 0.081 | 0.082 | 0.076 |
| F Statistic (df = 2; 996) | 8,700.073*** | 9,752.193*** | 11,296.350*** | 11,118.850*** |
| Note: | p<0.1; p<0.05; p<0.01 | |||
ggplot(as.data.frame(voters_scaled), aes(x = unfold_x, y = unfold_y))+
geom_point(aes(color = sim_actual_clust$simulated_cluster)) + labs(x = "preference ideal point x", y= "preference ideal point y") +
geom_point(aes(x = x, y = y, color = sim_actual_clust$cluster), data = voters_scaled, alpha = .2) +
theme_classic()
The results are very simmilar to my earlier finding that using prefscal or prcomp makes very little difference.
One direction Aaron thought might be worth pursuing was making the simulations more accurate by randomly eliminating preference information given the nature of the actual ballot data I have collected, which features many missing values. As shown below, redacting a significant portion of the preference information does not significantly damage the accuracy of the clustering algorithm.
vote_pref<-data.frame()
a <-seq(-1.5,1.5, by = .4) + rnorm(n = 11, mean = 0, sd = .5)
## Warning in seq(-1.5, 1.5, by = 0.4) + rnorm(n = 11, mean = 0, sd = 0.5): longer
## object length is not a multiple of shorter object length
b <- rnorm(n = 11, mean = 0, sd = .5)
candidates <- data.frame(a, b)
for(i in 1:length(voters$x)){
vote_pref<-rbind(vote_pref,rank(pointDistance(candidates, voters[i,2:3], lonlat = FALSE))) }
redact<- round(abs(rnorm(999, mean = 3, sd = 2)),0)
summary(redact)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 2.0 3.0 3.1 4.0 9.0
for (i in 1:length(vote_pref$X2)){
count <- 0
for (h in vote_pref[i,]){
if(count<redact[i]){
vote_pref[i,h] <- 12
}
count <- count+1
}
}
two_dim_three_clust_kmeans <-kmeans(vote_pref, centers = 3,# number of clusters
nstart = 100)# number of random starts
sim_actual_clust <- cbind(two_dim_three_clust_kmeans$cluster, voters) %>%
mutate(simulated_cluster = factor(one_dim_three_clust_kmeans$cluster, labels = c("Cluster 1", "Cluster 2", "Cluster 3")))
conf_mat(data = sim_actual_clust,
truth = cluster,
estimate = simulated_cluster )
## Truth
## Prediction Cluster 1 Cluster 2 Cluster 3
## Cluster 1 333 0 7
## Cluster 2 0 323 1
## Cluster 3 0 10 325
Overall, my experience with simulating ranked data and ideological spaces suggests three conclusions. 1) Clustering algorithms can identify latent clusters within the data with decent accuracy even given significant overlap and preference redaction. 2) PCA can identify the positioning of individual voters well but does perform well at reconstructing the underlying ideal point space, because of this, there is not a great deal of benefit of using PCA rather than clustering. 3) Basic PCA performs about as well as more specialized tools at performing these tasks. prefscal which estimates the position of both voters and candidates does about as poorly with candidates as it does with voters which is about as poor as PCA.
These findings lead me to the conclusion that I should focus on looking for underlying clusters in my ballot data, and that using PCA rather than prefscal (which is much more computationally intensive) was acceptable going forward.
Data for this project comes from three Democratic primaries that occured during the Summer of 2020 in which ranked choice voting was used. These primaries took place after Biden had consolidated the field and had more or less already won, so competition was less intense than it would have been in earlier primaries. Still, my hope was that this data would provide an interesting test case for the ideology space methods I have experimented above and an interesting question, what were the latent coalitions in the Democratic primary? Which candidates did voters see as similar to one another?
# Pull in the data
wyoming <- read_csv("~/Documents/Wyoming data import.csv")
AlaskaCVR <- read_excel("~/Documents/AlaskaCVR.xlsx", na = "under")
KansasCVR <- read_excel("KansasCVR.xlsx")
# relabel columns to enable pivoting
names<-colnames(wyoming)
names[6:10]<- c(1,2,3,4,5)
colnames(wyoming) <- names
# clean and relabel the candidates
wyoming_cleaned <- wyoming %>%
dplyr::select(BallotID, `1` ,`2` ,`3` ,`4`, `5` ) %>%
group_by(BallotID) %>%
mutate_all(funs(recode(., '1' = "Biden", '3' = "Steyer",
'4' = "Sanders", '5' = "Klobuchar",
'7' = "Gabbard", '8' = "Warren",
'9' = "Bloomberg", '11' = "Buttgieg",
'12' = "undeclared", 'over' = NA_character_,
'under' = NA_character_)))
# pivot the data to preference format
wyoming_l <- wyoming_cleaned %>%
pivot_longer(-BallotID, names_to = 'rank', values_to = "candidate", values_drop_na = TRUE) %>%
pivot_wider(names_from = candidate, values_from = rank, values_fn = first)
# relabel columns to enable pivoting
names<-colnames(AlaskaCVR)
names[7:11]<- c(1,2,3,4,5)
colnames(AlaskaCVR) <- names
AlaskaCVR$BallotID <- paste0(AlaskaCVR$`Tabulator Id`, AlaskaCVR$`Batch Id`, AlaskaCVR$`Record Id`)
# clean and relabel the candidates
alaska_cleaned <- AlaskaCVR %>%
dplyr::select(BallotID, `1` ,`2` ,`3` ,`4`, `5` ) %>%
group_by(BallotID) %>%
mutate_all(funs(recode(., '1' = "Biden", '3' = "Steyer",
'4' = "Sanders", '5' = "Klobuchar",
'7' = "Gabbard", '8' = "Warren",
'9' = "Bloomberg", '11' = "Buttgieg",
'12' = "undeclared", 'over' = NA_character_,
'under' = NA_character_)))
# pivot the data to preference format
alaska_l <- alaska_cleaned %>%
pivot_longer(-BallotID, names_to = 'rank', values_to = "candidate", values_drop_na = TRUE) %>%
pivot_wider(names_from = candidate, values_from = rank, values_fn = first)
# relabel columns to enable pivoting
names<-colnames(KansasCVR)
names[2:6]<- c(1,2,3,4,5)
colnames(KansasCVR) <- names
# clean and relabel the candidates
kansas_cleaned <- KansasCVR %>%
dplyr::select(BallotID, `1` ,`2` ,`3` ,`4`, `5` ) %>%
group_by(BallotID) %>%
mutate_all(funs(recode(., '1' = "Biden", '3' = "Steyer",
'4' = "Sanders", '5' = "Klobuchar",
'7' = "Gabbard", '8' = "Warren",
'9' = "Bloomberg", '11' = "Buttgieg",
'12' = "undeclared", 'over' = NA_character_,
'under' = NA_character_)))
# pivot the data to preference format
kansas_l <- kansas_cleaned %>%
pivot_longer(-BallotID, names_to = 'rank', values_to = "candidate", values_drop_na = TRUE) %>%
pivot_wider(names_from = candidate, values_from = rank, values_fn = first)
# To do: Losing observations between cleaned and long versions, why?
Having converted all the data to the same format, I label the rows by source, merge them into a single dataset, and run prcomp.
## Add a variable for state name
alaska_l$state<- 'alaska'
wyoming_l$state<- 'wyoming'
kansas_l$state<- 'kansas'
## Bind state dataframes together
all_l<- rbind(alaska_l, kansas_l, wyoming_l)
all_matrix <- all_l[,2:10] %>%
mapply(FUN=as.numeric)
all_cleaned <- rbind(alaska_cleaned, kansas_cleaned, wyoming_cleaned)
all_mat <- all_l[,2:10] %>%
mutate_all(~replace(., is.na(.), 6)) %>%
mapply(FUN=as.numeric)
principle_components_all <- all_mat%>%prcomp()
# Also failed to run with 128GB ram
# pref_scale_all <- all_matrix %>% prefscal(ndim=2,type="ordinal",conditionality="matrix")
all_pca <-bind_cols(all_l,as_tibble(principle_components_all$x)) #, as.tibble(pref_scale_all$conf.row))
ggplot(data = all_pca) +
geom_density(aes(x = PC1, fill = state),alpha = .4) +
theme_minimal()
Without clustering the data or labeling the candidates. This density plot already tells us a few things about the nature of these primaries. First of all, all of them were won by Biden, which suggested that the giant peaks at the left hand side of the graph are voters who ranked Biden and no one else. Second, given that Sanders came in second in all of these states and that Sanders supporters were more likely to rank other candidates, the peaks in the middle of the chart likely correspond to Sanders first Biden second or Biden second Sanders first on the center left and Sanders voters who ranked other candidates such as Warren second on the center right. The right tail of the distribution is likely composed of also rans. Do the clusters generated by knn confirm to these guesses?
First, I need to decide how many clusters to use. These cluster selection algorithms wound up crashing even after spinning up a 128 GB Ram AWS machine (screenshots attached) to run them, so I wound up selecting the number of clusters by sampling the data instead. 8 was used as a cutoff for number of clusters because part of the utility of the clusters for this application is for there to be fewer clusters than candidates.
## This takes way too long, even in AWS, wound up going with the backup plan of sampling the full dataset
library(factoextra)
fviz_nbclust(sample_n(as.data.frame(all_mat), 3000), FUN = kmeans, method = "wss", k.max = 8)
fviz_nbclust(sample_n(as.data.frame(all_mat), 3000), FUN = kmeans, method = "gap_stat", k.max = 8)
fviz_nbclust(sample_n(as.data.frame(all_mat), 3000), FUN = kmeans, method = "silhouette", k.max = 8)
All three cluster selection algorithms suggest that 6 is a good number of clusters to use, so that is what I went with.
Why do some of these tools not work even with 16 cores (well the cores do not help that much) and 128 GBs of RAM? Some of the errors I ran into indicated that they might have worked with 256 GB of RAM, but I did not want to risk spinning up an AWS instance that big for something that time consuming and have it not work. These tools take up so much memory by scanning a larger and larger possibility space trying to arrive at a specification where their error stat converges towards a global minima, so one possibility given how much RAM these were already trying to use is that such a minima does not exist for how I am setting them up.
all_cluster<-all_pca
# predict six clusters
all_kmeans <-kmeans(all_mat,centers = 6,# number of clusters
nstart = 100)# number of random starts
all_cluster$cluster<-all_kmeans$cluster
all_pca<-left_join(all_cluster, all_cleaned, by = 'BallotID' )
table1 <- all_pca%>%
group_by(`1`, cluster) %>%
summarize(count=n()) %>%
arrange(desc(count)) %>%
spread(`1`,count)
## `summarise()` regrouping output by '1' (override with `.groups` argument)
kable(table1)
| cluster | Biden | Bloomberg | Buttgieg | Gabbard | Klobuchar | Sanders | Steyer | undeclared | Warren | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 16789 | 53 | 181 | 91 | 113 | 21 | 20 | NA | 2549 | 54 |
| 2 | 45537 | 13 | 19 | 151 | 3 | 11497 | 2 | NA | 7043 | 8 |
| 3 | 42023 | 81 | 140 | 112 | 68 | 2 | 30 | NA | 1 | 62 |
| 4 | 3564 | 11 | 11 | 623 | 13 | 659 | 6 | 4448 | 84 | 147 |
| 5 | 14764 | 68 | 93 | 96 | 26 | 11191 | 33 | NA | 2 | 54 |
| 6 | 67 | 24 | 96 | 774 | 47 | 14595 | 20 | 37 | 3900 | 48 |
What can the cluster algorithm tell us about the underlying coalitions in the ballot data? 1) Cluster 1: Primarily composed of Biden-Warren voters, this makes up a relatively small chunk of Biden’s base, and a larger but still relatively small chunk of Warren’s base. Given that Warren supporters were percieved as ideologically similar to Sanders supporters but frequently clashed for other reasons, we can think of this cluster as Biden voters who liked Warren and Warren voters who disliked Sanders.
Cluster 2: This cluster is Biden-Sanders-Warren voters and makes up the bulk of Warren’s base and about a third of Sanders’ and Biden’s bases respectively. It is also the largest cluster overall. We can think of this cluster as ‘team player’ Democrats who liked all of the major candidates and just wanted them to get along.
Cluster 3: This cluster is clearly composed of Biden-only or Biden-minor-candidate voters. It is also the second largest cluster. Given the presence of Cluster 2, these voters are probably just Democrats who did not see the utility of ranking more candidates given that Biden was going to win, not Democrats who were deeply hostile to the rest of the field.
Cluster 4: This is a very small cluster composed mostly of voters for undeclared candidates. I do not know of any major write-in campaigns within the Democratic primary, but maybe Kanye got some voters, and the rest are ‘Biden is going to win, just going to write-in someone funny’ voters.
Cluster 5: This cluster is composed of Biden and Sanders but not Warren voters. These voters may have been motivated by sexism or disagreeements that occured between Sanders and Warren and Biden and Warren, or they may just have not seen the utility of ranking Warren when she was so far behind.
Cluster 6: In the same way that Cluster 3 is the Biden-only cluster, this one is very clearly the Sanders-only and Sanders-Warren cluster. Given that Sanders had very little hope of winning at this point. Voting only for Sanders or for Sanders and Warren was a clear signal of disatisfaction with the likely nomination of Biden.
ggplot(data = all_cluster) +
geom_point(aes(x = PC1, y =PC2, color = factor(cluster)),alpha = .4) +
theme_minimal()
ggplot(data = all_cluster) +
geom_density(aes(x = PC1, fill = factor(cluster)),alpha = .4) +
theme_minimal()
What does all of this tell us? Not a ton that is not obvious, but I came away from this clustering case study with a few takeaways.
While the utility of clustering as a tool for ranked data would be clearer if we were trying to compare across voting systems or had another dataset we wanted to merge onto (like census data or a precinct map), overall I think this exercise shows that the clusters would likely have interesting implications for such a study, and that the clusters are interesting in and of themselves.